Introduction

This tutorial will walk you through the steps of a descriptive analysis using dplyr for data merge and cleaning.

At the end of each page you can click “Continue” to move to the next. The code given runs automatically, so you don’t have do run each block, although you can explore the code or objects in an interactive way if you like.

Looking Over at the Data

Two data frames have been loaded: schools and dists, pertaining to schools and districts respectively.

You probably notice a few things about the data frames. First they are both very big. You can see just how big with the dim function.

## [1] 99074    47
## [1] 18567    39

Renaming the Columns

Another thing you may have notices is that most of the column names are long and hard to read.

##  [1] "School.Name"                                                    
##  [2] "State.Name..Public.School..Latest.available.year"               
##  [3] "State.Abbr..Public.School..Latest.available.year"               
##  [4] "studyID"                                                        
##  [5] "instudy"                                                        
##  [6] "School.ID...NCES.Assigned..Public.School..Latest.available.year"
##  [7] "Agency.Name..Public.School..2013.14"                            
##  [8] "Agency.ID...NCES.Assigned..Public.School..Latest.available.year"
##  [9] "School.Type..Public.School..2013.14"                            
## [10] "Agency.Type..District..2013.14"

Retyping them all would take a long time. A quicker way is to cut the names off at the first instance of multiple periods (..). Often times this sort of text parsing will require regular expressions, but here we have an even simpler option using the strsplit function built into R. This function breaks one character string up into separate strings, around a specific character.

## [[1]]
## [1] "O"           "ce upo"      " a mid"      "ight dreary"

In order to apply this to the entire vector we can use the sapply with an anonymous function that runs strsplit and subsets the first string (everything before the first ..).

##  [1] "School.Name" "State.Name"  "State.Abbr"  "studyID"     "instudy"    
##  [6] "School.ID"   "Agency.Name" "Agency.ID"   "School.Type" "Agency.Type"

The fixed=T argument was necessary because otherwise the periods would be interpreted as regular expressions, giving strange results:

## [[1]]
##  [1] "" "" "" "" "" "" "" "" "" ""
## [[1]]
## [1] "Generic"     "Column.Name"

Subsetting

Depending on the work you want to do, its likely that you only need specific subset of the data. Here we will pull out High Schools specifically, and use only them going forward.

The dplyr package provides simple and very readable functions for subsetting and cleaning data, which will be very useful in this tutorial. We should start by loading the package

A subset is then saved as a separate variable with the filter function. The second argument is the condition we want to filter the schools by, here those that have the value of “3-High” in the School.Level.Code variable.

## [1] 99074    47
## [1] 20756    47

This has significantly decreased the size of the data we’re working with.

Merging

Because we also may want to know information about the school district that each school is in, we must merge the two data frames. Dplyr also provides convenient merging functions.

We have to decide which variables we will use to link the different records - in this case we are lucky in that the agency ID and names are given in the school data, as well as information about the state the school or district is in, which should also match between the two sets.

We will use the left_join function, as we want to find one row from the Districts data to match each row from the Schools. Therefore schools, listed first and therefore on the “left” side is the basis of the merger. As some variable names will be common between the two sets, we can use the suffic argument to append “.sch” or “.dist” to school or district variables, respectively.

## Warning: Column `Agency.Name` joining factors with different levels,
## coercing to character vector
## Warning: Column `Agency.Type` joining factors with different levels,
## coercing to character vector
## [1] 20756    47
## [1] 20756    81

Recoding

Binary Variables

Part of the data cleaning process involves recoding data into a more useful form. One of the simple things we can do is make variables with only two options into T/F. We can easily apply this conversion to multiple variables using Dplyr’s mutate_each function. The second argument, funs(. == "1-Yes"), takes each variable and evaluates it as equal to the value “1-Yes”, making it into a T/F logical variable.

## `mutate_each()` is deprecated.
## Use `mutate_all()`, `mutate_at()` or `mutate_if()` instead.
## To map `funs` over a selection of variables, use `mutate_at()`

Urbanicity

Looking at the Urbanicity variable, its categories may be too granular for certain purposes.

##  [1] "11-City: Large"      "12-City: Mid-size"   "13-City: Small"     
##  [4] "21-Suburb: Large"    "22-Suburb: Mid-size" "23-Suburb: Small"   
##  [7] "31-Town: Fringe"     "32-Town: Distant"    "33-Town: Remote"    
## [10] "41-Rural: Fringe"    "42-Rural: Distant"   "43-Rural: Remote"

We can also create an alternate version that combines the categories into four based on the overall urban category: City, Suburb, Town, and Rural.

Using Dplyr’s select function to look at only the variables of interest, we can review our newly recoded columns.

Summary Statistics

Summaries of variables can be useful in understanding their distributions. The stargazer package provides a summary statistics in an attractive format.

We will select a set of variables to examine

Statistic Mean St. Dev. Min Max
Male.Students.sch 368.61 374.90 0 3,197
Female.Students.sch 352.64 362.04 0 2,454
American.Indian.Alaska.Native.Students.sch 7.38 30.96 0 1,462
Asian.or.Asian.Pacific.Islander.Students.sch 36.38 120.97 0 3,309
Hispanic.Students.sch 164.04 357.76 0 4,264
Black.Students.sch 111.00 220.59 0 2,899
White.Students.sch 382.33 436.80 0 3,771
Hawaiian.Nat.sch 2.48 22.23 0 1,218
Total.Students.sch 721.18 734.56 0 5,457

This is a simple table, giving the mean, median, minimum and maximum values of the variables, as well as the standard deviation. We Can see from this data that on average there is a low number of Native American, Hawaiian, and Asian Students, more commonly Black or Hispanic, and on average the highest number of White Students in each school, however by the standard deviations we can also see that these numbers have great variance in different schools.

Graph

Please launch the ShinyPlot.Rmd file in Rstudio and compile it with the “Run Document” button.

However, here are some histograms that represent what can be seen in the interactive one:

Looking through these histograms, we can learn a lot about the data set. We can see that Large Suburbs appear to be the most common urbanicity group, though with the collapsed categories, rural areas in general are more common (towns are the least common - which may be a result of the urban-rural divide, although this is likely not the correct dataset to analyze that). We see that with the racial variables, there tend to be many schools with very a very low percentage of that race. It is interesting that this is the case across all the races, with the exception of white students, of which there tend to be either a very low or a very high percentage of.